Template-Based Information Mining from HTML Documents
نویسندگان
چکیده
Tools for mining information from data can create added value for the Iqternet. As the majority of electronic documents available over the network are in unstructured textual form, extracting useful information from a document usually involves information retrieval techniques or manual processing. This paper presents a novel approach to mining information from HTML documents using tree-structured templates. In addition to syntactic and semantic descriptions, each template is designed to capture the logical structure of a class of documents. Experiments have been conducted to extract FAQ information automatically frorn over one hundred HTML documents collected from the Web. Using two basic templates, the prototype FAQ Miner has accurately analyzed 65% of the collection of FAQ documents. With additional processing to handle “near-pass”es, the success rate is approximately 75%. The preliminary results have demonstrated the utility of structural templates for mining information from semi-structured text-based documents.
منابع مشابه
Web Content Extraction to Facilitate Web Mining
Internet continuously strives to become the prime source of knowledge and Information, used in almost every sphere of life. As the volume and complexity of the Information shared on WEB is increasing, various forms of representation of this data has been emerged. In order to deal with different forms of data, different technologies have been discovered to efficiently provide the Information to ...
متن کاملEfficient Algorithm for Mining on Bio Medical Data for Ranking the Web Pages
Information in the internet is evolving in terms of high volume through different sources. Extracting tuples from HTML pages has been an important issue in various web applications such as web data integration, e-commerce market monitoring, and mash ups that repurpose and selectively combine existing web data services. Data Mining is the process of analyzing data from different perspectives and...
متن کاملEfficient Algorithm for Mining on Bio Medical Data for Ranking the Web Pages
Information in the internet is evolving in terms of high volume through different sources. Extracting tuples from HTML pages has been an important issue in various web applications such as web data integration, e-commerce market monitoring, and mash ups that repurpose and selectively combine existing web data services. Data Mining is the process of analyzing data from different perspectives and...
متن کاملA ME Model Based on Feature Template for Chinese Text Categorization
With entering into information society and the Internet developing rapidly, people could acquire more and more information. How to utilize Internet information efficiently and promptly, has became a hotspot in information technology. Text categorization is an important component to help getting useful message from tremendous amount of vast information. And it assigns new documents to pre-define...
متن کامل